The data set baseball-2016.xlsx contains information about the scores of baseball teams in USA in 2016, such as:
Games won, Games Lost, Runs peer game, At bats, Runs, Hits, Doubles, Triples, Home runs, Runs batted in, Bases stolen, Time caught stealing, Bases on Balls, Strikeouts, Hits/At Bats, On Base Percentage, Slugging percentage, On base+Slugging, Total bases, Double plays grounded into, Times hit by pitch, Sacrifice hits, Sacrifice flies, Intentional base on balls, and Runners Left On Base.
The dataset is conformed of 30 unique baseball teams with some statistics about their runs in some leagues (NL and AL). In this case it would be reasonable to scale (MDS) or reduce the dimensionality of the vectors in order to get a more digestiable dataset, on which we can compare each team and see how close to each other they are.
## [1] "Dimensions of the dataset:"
## [1] 30 28
There seems to be a difference between Leagues, but it’s not that clear. It’s visible that most of the teams (66.6%) that belong to the AL League are on the positive axis of V2 while the teams from the NL League are more spread across this axis. So there might be some differences between both leagues, but they are not that pronounced.
The component that seems that helps more differentiate between Leagues is the V2 as stated above.
## initial value 19.856833
## iter 5 value 16.319153
## iter 10 value 16.046215
## final value 15.935476
## converged
## Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels
## Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels
The teams that seem to be outliers are:
It looks like the MDS performance was average, taking into account that a perfect encoding of dissimilarities would yield a 1 to 1 relationship between variables. There were a few observations that looks like outliers that were pretty difficult for the MDN algorithm to map. Below is presented some of those points that were hard for the algorithm.
The best two variables that separate both leagues are SH and IBB. Both variables refers to the number that a certain play is made. For SH the play is called sacrifice hits and for IBB the play is called intentional bases on balls. They are abbreviations for offensive plays. Which could lead to believe that the variable V2 obtained from the MDS is related in some way with defensive plays. The direction to which defense grows will depend on whether the NL League is more defensive than the AL League.